In [1]:
import pandas as pd
import numpy as np
import plotly
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
Data Description¶
- VIN (1-10): The first 10 characters of the vehicle’s VIN (Vehicle Identification Number).
- County: The region where the vehicle’s owner is registered. Owners in Washington may have vehicles located in other states.
- City: The city where the owner lives.
- State: The state linked to the vehicle's registration.
- Postal Code: The 5-digit ZIP code of the owner’s address.
- Model Year: The year of the vehicle, decoded from the VIN.
- Make: The manufacturer of the vehicle, decoded from the VIN.
- Model: The vehicle’s model, decoded from the VIN.
- Electric Vehicle Type: Whether the vehicle is fully electric or a plug-in hybrid.
- CAFV Eligibility: Whether the vehicle qualifies as a Clean Alternative Fuel Vehicle (CAFV) based on Washington's 2019 House Bill 2042.
- Electric Range: How far the vehicle can go on electric power alone.
- Base MSRP: The lowest suggested retail price for the vehicle’s model.
- Legislative District: The Washington state legislative area where the owner lives.
- DOL Vehicle ID: A unique ID assigned by the Department of Licensing (DOL).
- Vehicle Location: The center of the vehicle's registered ZIP code.
- Electric Utility: The electric utility serving the owner’s address.
- 2020 Census Tract: The census tract code (state + county + tract) assigned by the U.S. Census Bureau.
In [2]:
df = pd.read_csv("dataset.csv")
df.head()
Out[2]:
| VIN (1-10) | County | City | State | Postal Code | Model Year | Make | Model | Electric Vehicle Type | Clean Alternative Fuel Vehicle (CAFV) Eligibility | Electric Range | Base MSRP | Legislative District | DOL Vehicle ID | Vehicle Location | Electric Utility | 2020 Census Tract | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | JTMEB3FV6N | Monroe | Key West | FL | 33040 | 2022 | TOYOTA | RAV4 PRIME | Plug-in Hybrid Electric Vehicle (PHEV) | Clean Alternative Fuel Vehicle Eligible | 42 | 0 | NaN | 198968248 | POINT (-81.80023 24.5545) | NaN | 12087972100 |
| 1 | 1G1RD6E45D | Clark | Laughlin | NV | 89029 | 2013 | CHEVROLET | VOLT | Plug-in Hybrid Electric Vehicle (PHEV) | Clean Alternative Fuel Vehicle Eligible | 38 | 0 | NaN | 5204412 | POINT (-114.57245 35.16815) | NaN | 32003005702 |
| 2 | JN1AZ0CP8B | Yakima | Yakima | WA | 98901 | 2011 | NISSAN | LEAF | Battery Electric Vehicle (BEV) | Clean Alternative Fuel Vehicle Eligible | 73 | 0 | 15.0 | 218972519 | POINT (-120.50721 46.60448) | PACIFICORP | 53077001602 |
| 3 | 1G1FW6S08H | Skagit | Concrete | WA | 98237 | 2017 | CHEVROLET | BOLT EV | Battery Electric Vehicle (BEV) | Clean Alternative Fuel Vehicle Eligible | 238 | 0 | 39.0 | 186750406 | POINT (-121.7515 48.53892) | PUGET SOUND ENERGY INC | 53057951101 |
| 4 | 3FA6P0SU1K | Snohomish | Everett | WA | 98201 | 2019 | FORD | FUSION | Plug-in Hybrid Electric Vehicle (PHEV) | Not eligible due to low battery range | 26 | 0 | 38.0 | 2006714 | POINT (-122.20596 47.97659) | PUGET SOUND ENERGY INC | 53061041500 |
In [3]:
df.columns
Out[3]:
Index(['VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year',
'Make', 'Model', 'Electric Vehicle Type',
'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
'Vehicle Location', 'Electric Utility', '2020 Census Tract'],
dtype='object')
Data Cleaning¶
Dropping the null values as they are less than 5% of missing values¶
In [4]:
df.isna().sum()/len(df)
Out[4]:
VIN (1-10) 0.000000 County 0.000000 City 0.000000 State 0.000000 Postal Code 0.000000 Model Year 0.000000 Make 0.000000 Model 0.000178 Electric Vehicle Type 0.000000 Clean Alternative Fuel Vehicle (CAFV) Eligibility 0.000000 Electric Range 0.000000 Base MSRP 0.000000 Legislative District 0.002539 DOL Vehicle ID 0.000000 Vehicle Location 0.000213 Electric Utility 0.003933 2020 Census Tract 0.000000 dtype: float64
In [5]:
df.dropna(inplace=True)
dropping the unwanted columns as they are contain values which are all unique and doesn't give any useful information for analysis¶
In [6]:
df.columns
Out[6]:
Index(['VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year',
'Make', 'Model', 'Electric Vehicle Type',
'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
'Vehicle Location', 'Electric Utility', '2020 Census Tract'],
dtype='object')
In [7]:
df.shape
Out[7]:
(112152, 17)
In [8]:
df.nunique()
Out[8]:
VIN (1-10) 7522 County 39 City 435 State 1 Postal Code 516 Model Year 20 Make 34 Model 114 Electric Vehicle Type 2 Clean Alternative Fuel Vehicle (CAFV) Eligibility 3 Electric Range 101 Base MSRP 30 Legislative District 49 DOL Vehicle ID 112152 Vehicle Location 516 Electric Utility 73 2020 Census Tract 1760 dtype: int64
In [9]:
import re
def xed(x):
k,j = re.findall("\s.*",x)[0].strip(" (").strip(")").split(" ")
return (float(k),float(j))
In [10]:
xed(df["Vehicle Location"][2])
Out[10]:
(-120.50721, 46.60448)
In [11]:
df.columns
Out[11]:
Index(['VIN (1-10)', 'County', 'City', 'State', 'Postal Code', 'Model Year',
'Make', 'Model', 'Electric Vehicle Type',
'Clean Alternative Fuel Vehicle (CAFV) Eligibility', 'Electric Range',
'Base MSRP', 'Legislative District', 'DOL Vehicle ID',
'Vehicle Location', 'Electric Utility', '2020 Census Tract'],
dtype='object')
In [12]:
df["coordinates"] = df["Vehicle Location"].apply(lambda x: xed(x))
In [13]:
df.drop(columns=['DOL Vehicle ID','Vehicle Location'],inplace=True)
EDA¶
Univariate¶
In [14]:
df["County"].value_counts().head(10).plot(kind="bar",color="green")
plt.title("The top 10 counties by vechiles registered count")
plt.show()
> From the above we can see the top 10 counties where majority of people are using electronic vehicles as a transport means. The King county has the largest no.of vehicles registered.¶
In [15]:
df["Make"].value_counts().head(10).plot(kind="bar",color="red")
plt.title("The top 10 vehicle brands by vehicles in use count")
plt.show()
In [16]:
df["Make"].value_counts(ascending=True).head()
Out[16]:
Make TH!NK 3 BENTLEY 3 AZURE DYNAMICS 7 GENESIS 18 FISKER 19 Name: count, dtype: int64
> From the above we can observe that Tesla is dominating the electric vehicles industry with such a huge customer base, while some brands like THINK & BENTLY have such a low presence in the market.¶
In [17]:
df["Electric Vehicle Type"].value_counts().plot(kind="pie",color="yellow",autopct="%.2f",shadow=True,explode=[0.1,0])
plt.title("Distribution of Electric Vehicle types in-use")
plt.ylabel("")
plt.show()
> From the above we can observe that Battery based Electric Vehicles are more in use than Plug-in Hybrid based Electric Vehicles.¶
In [18]:
sns.histplot(df["Model Year"],kde=True,color="blue",bins=20)
plt.title("Distribution of vehicles manufactured year")
plt.show()
> From the above we can observe that majority of the people are using vehicles that were manufactured in the years: 2021 - 2022¶
In [19]:
sns.displot(df["Electric Range"],kde=True,color="purple")
plt.title("The distribution of the battery range of the vehicles")
plt.show()
In [20]:
df['Clean Alternative Fuel Vehicle (CAFV) Eligibility'].value_counts().plot(kind="pie",autopct="%.2f",explode=[0.02,0.02,0.01])
plt.title("Distribution of CAFV Eligibility")
plt.ylabel("")
plt.show()
In [21]:
df.groupby("Clean Alternative Fuel Vehicle (CAFV) Eligibility")["Electric Range"].agg(["min","max","mean","count"])
Out[21]:
| min | max | mean | count | |
|---|---|---|---|---|
| Clean Alternative Fuel Vehicle (CAFV) Eligibility | ||||
| Clean Alternative Fuel Vehicle Eligible | 30 | 337 | 163.822194 | 58395 |
| Eligibility unknown as battery range has not been researched | 0 | 0 | 0.000000 | 39097 |
| Not eligible due to low battery range | 6 | 29 | 19.363847 | 14660 |
> From the above, we can observe that around 48% of the cars have a battery range 0-30 which might be due to some vehicles not having the eligibility due to low battery percentage and others not having their battery range researched. The reason for that can be because of the vehicle being an old model which were manufactured before the electric vehicles era.¶
In [22]:
df["Model"].value_counts().head(10).plot(kind="barh",color="gold")
plt.title("The top 10 vehicle models by vehicles in-use count")
plt.show()
> From the above we can observe that MODEL 3, MODEL Y and LEAF are the models with highest demand in the market¶
In [23]:
df["Electric Utility"].value_counts().head(10).plot(kind="barh",color="navy")
plt.title("The top 10 power suppliers for the electric vehicle owners")
plt.show()
> From the above we can observe the top 10 electric power providers respective to the customers locations and usage.¶
In [24]:
temp = df[df["Base MSRP"]!=0] # removing the values 0 as the vehicles can't be sold for free. Considering that their values might be missing
In [25]:
sns.histplot(data=temp,x="Base MSRP",kde=True)
plt.title("The distribution of Base MSRP of the vehicles")
plt.show()
> From the above we can say that most of the vehicles have an Base MSRP of ranging between 0 to 100,000.¶
Bivariate¶
In [26]:
df.groupby("Make")["Model"].nunique().sort_values(ascending=False).head(10).plot(kind="barh",color="teal")
plt.title("No.of varaints available in each of the top 10 brands by count")
plt.show()
> From the above we can observe the top 10 brands which have a large range of variants available/being used in the market.¶
In [27]:
sns.histplot(data=df,x="Model Year",hue="Electric Vehicle Type",bins=20, multiple='stack')
plt.title("No.of vehicles manufactured in each year along with the type of electric vehicle")
plt.show()
> From the above plot we can observe that there was a cut in the manufacturing of vehicles in the year 2020 which might be due to the covid pandemic post which the production rose back up. The vehicles industry had a bloom in the 2019 and again in 2023.¶
In [28]:
df.groupby("Model")["Electric Range"].mean().sort_values(ascending=False).head(10).plot(kind="bar",color="olive")
Out[28]:
<Axes: xlabel='Model'>
> From the above we can observe the top 10 vehicle models with a high average electric range.¶
In [29]:
temp = df[df["Electric Range"]!=0] # removing the values 0 as the battery based vehicles can't be having 0 Range. Assuming that their values might be missing.
In [30]:
sns.violinplot(data=temp,x="Electric Range",hue="Electric Vehicle Type")
plt.title("No.of vehicles manufactured in each year along with the type of electric vehicle")
plt.show()
> From the above we can see that the plug-in type vehicles mostly have a range of 0-50, while the battery type vehicles has a wide distribution from 170 to 300.¶
Create a Choropleth using plotly.express to display the number of EV vehicles based on location.¶
In [31]:
df["County"].unique()
Out[31]:
array(['Yakima', 'Skagit', 'Snohomish', 'Island', 'Thurston', 'Grant',
'King', 'Kitsap', 'Whitman', 'Spokane', 'Cowlitz', 'Pierce',
'Kittitas', 'Grays Harbor', 'Clark', 'Chelan', 'Whatcom', 'Benton',
'Walla Walla', 'Mason', 'San Juan', 'Lewis', 'Jefferson',
'Clallam', 'Douglas', 'Klickitat', 'Skamania', 'Adams', 'Franklin',
'Okanogan', 'Stevens', 'Asotin', 'Ferry', 'Pacific', 'Columbia',
'Wahkiakum', 'Lincoln', 'Pend Oreille', 'Garfield'], dtype=object)
In [32]:
# df['County'] = df['County'] + " County"
temp = df.groupby("County")["VIN (1-10)"].count().reset_index()
temp.columns = ['County','EV Count']
county_fips = {
'Adams': '53001', 'Asotin': '53003', 'Benton': '53005', 'Chelan': '53007',
'Clallam': '53009', 'Clark': '53011', 'Columbia': '53013', 'Cowlitz': '53015',
'Douglas': '53017', 'Ferry': '53019', 'Franklin': '53021', 'Garfield': '53023',
'Grant': '53025', 'Grays Harbor': '53027', 'Island': '53029', 'Jefferson': '53031',
'King': '53033', 'Kitsap': '53035', 'Kittitas': '53037', 'Klickitat': '53039',
'Lewis': '53041', 'Lincoln': '53043', 'Mason': '53045', 'Okanogan': '53047',
'Pacific': '53049', 'Pend Oreille': '53051', 'Pierce': '53053', 'San Juan': '53055',
'Skagit': '53057', 'Skamania': '53059', 'Snohomish': '53061', 'Spokane': '53063',
'Stevens': '53065', 'Thurston': '53067', 'Wahkiakum': '53069', 'Walla Walla': '53071',
'Whatcom': '53073', 'Whitman': '53075', 'Yakima': '53077'
}
temp['FIPS'] = temp['County'].map(county_fips)
In [33]:
temp
Out[33]:
| County | EV Count | FIPS | |
|---|---|---|---|
| 0 | Adams | 34 | 53001 |
| 1 | Asotin | 48 | 53003 |
| 2 | Benton | 1376 | 53005 |
| 3 | Chelan | 654 | 53007 |
| 4 | Clallam | 728 | 53009 |
| 5 | Clark | 6681 | 53011 |
| 6 | Columbia | 13 | 53013 |
| 7 | Cowlitz | 569 | 53015 |
| 8 | Douglas | 221 | 53017 |
| 9 | Ferry | 27 | 53019 |
| 10 | Franklin | 365 | 53021 |
| 11 | Garfield | 4 | 53023 |
| 12 | Grant | 335 | 53025 |
| 13 | Grays Harbor | 402 | 53027 |
| 14 | Island | 1298 | 53029 |
| 15 | Jefferson | 698 | 53031 |
| 16 | King | 58980 | 53033 |
| 17 | Kitsap | 3828 | 53035 |
| 18 | Kittitas | 392 | 53037 |
| 19 | Klickitat | 175 | 53039 |
| 20 | Lewis | 431 | 53041 |
| 21 | Lincoln | 30 | 53043 |
| 22 | Mason | 547 | 53045 |
| 23 | Okanogan | 149 | 53047 |
| 24 | Pacific | 145 | 53049 |
| 25 | Pend Oreille | 32 | 53051 |
| 26 | Pierce | 8525 | 53053 |
| 27 | San Juan | 717 | 53055 |
| 28 | Skagit | 1228 | 53057 |
| 29 | Skamania | 139 | 53059 |
| 30 | Snohomish | 12412 | 53061 |
| 31 | Spokane | 2785 | 53063 |
| 32 | Stevens | 91 | 53065 |
| 33 | Thurston | 4109 | 53067 |
| 34 | Wahkiakum | 39 | 53069 |
| 35 | Walla Walla | 312 | 53071 |
| 36 | Whatcom | 2839 | 53073 |
| 37 | Whitman | 177 | 53075 |
| 38 | Yakima | 617 | 53077 |
In [34]:
geojson_url = 'https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json'
fig = px.choropleth(
temp,
geojson=geojson_url,
locations='FIPS', # Use FIPS codes for matching
color='EV Count',
scope='usa',
title='Number of Electronic vehicles in Washington State County-wise',
labels={'EV Count': 'EV Count'}
)
fig.update_geos(fitbounds="locations", visible=False) # Focus on WA state
fig.show()
Create a Racing Bar Plot to display the animation of EV Make and its count each year.¶
In [35]:
import bar_chart_race as bcr
In [36]:
temp = df.groupby(["Make","Model Year"])["VIN (1-10)"].count().reset_index(name='Count')
temp = temp[(temp["Model Year"]>=2012 )&(temp["Model Year"]<=2024)] # taking the data of past 10 years
temp = temp.sort_values(by="Model Year")
In [37]:
fig = px.bar(
temp,
x='Count',
y='Make',
color='Make', # Optional: Color bars by Make
animation_frame='Model Year', # Use Year for the animation
orientation='h', # Horizontal bar plot
title='EV Make Count Over the Years',
labels={'Count': 'EV Count', 'Make': 'Vehicle Make'},
range_x=[0, temp['Count'].max() + 100] # Adjust range for better visualization
)
fig.update_layout(
xaxis=dict(showgrid=True, gridcolor='LightGray', tickformat=',d'),
yaxis_title='Make',
xaxis_title='Number of Vehicles',
showlegend=False,
title_x=0.5,
margin=dict(l=50, r=50, t=50, b=50),
width=800,
height=600
)
fig.show()
In [38]:
ev_counts = temp.pivot_table(index='Model Year', columns='Make', values='Count', fill_value=0)
In [39]:
# Pivot the DataFrame to the right format
ev_counts = temp.pivot_table(index='Model Year', columns='Make', values='Count', fill_value=0)
bar_chart_params = {
'df': ev_counts, # DataFrame for the chart
'filename': "ev_racing_bar.mp4", # Output file name
'orientation': "h", # Horizontal bar plot
'sort': "desc", # Sort bars in descending order
'n_bars': 10, # Display the top 10 bars
'steps_per_period': 50, # Steps for smoothness
'period_length': 2000, # Duration of each period in milliseconds
'title': 'EV Make Count Over the Years' # Title settings
}
# Create the racing bar plot and save it as a video
bcr.bar_chart_race(**bar_chart_params)
# Confirmation message
print("Video saved as ev_racing_bar.mp4")
C:\Users\vijay\anaconda3\Lib\site-packages\bar_chart_race\_make_chart.py:889: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead. C:\Users\vijay\anaconda3\Lib\site-packages\bar_chart_race\_make_chart.py:286: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. C:\Users\vijay\anaconda3\Lib\site-packages\bar_chart_race\_make_chart.py:287: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. C:\Users\vijay\anaconda3\Lib\site-packages\bar_chart_race\_make_chart.py:226: UserWarning: Some of your columns never make an appearance in the animation. To reduce color repetition, set `filter_column_colors` to `True`
Video saved as ev_racing_bar.mp4
In [ ]: